requirements=c("summarytools", "pROC", "glmnetUtils", "dplyr", "car", "effects", "gridExtra", "grid", "MASS","e1071", "mgcv")
for (req in requirements){
if (!require(req, character.only = TRUE)){
install.packages(req)
}
}
shootings <- readRDS("data/shootings_known.Rda")
print(dfSummary(shootings), method="render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | day_period [factor] |
|
|
13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | day_year [numeric] |
|
366 distinct values | 13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | year [numeric] |
|
17 distinct values | 13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | week_day [factor] |
|
|
13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | COVID_lockdown [factor] |
|
|
13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | COVID_pandemic [factor] |
|
|
13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | working_hour [factor] |
|
|
13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | Latitude [numeric] |
|
7051 distinct values | 13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | Longitude [numeric] |
|
7047 distinct values | 13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | city_location [factor] |
|
|
13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | perp_age [factor] |
|
|
13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | vic_age [factor] |
|
|
13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | perp_sex [factor] |
|
|
13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | vic_sex [factor] |
|
|
13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | perp_race [factor] |
|
|
13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | vic_race [factor] |
|
|
13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 17 | other_victims [numeric] |
|
11 distinct values | 13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 18 | jurisdiction [factor] |
|
|
13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 19 | murder [factor] |
|
|
13873 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20 | murder_prob [numeric] |
|
|
13873 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19
set.seed(2)
train <- sample(nrow(shootings), 0.80*nrow(shootings))
test <- (-train)
shootings.train <- subset(shootings[train,], select = -murder )
shootings.test <- subset(shootings[test,], select = -murder )
dim(shootings.train)
## [1] 11098 19
dim(shootings.test)
## [1] 2775 19
Let’s start with a model with all the predictors.
glm.full <- glm(murder_prob ~ ., data=shootings.train, family=binomial)
summary(glm.full)
##
## Call:
## glm(formula = murder_prob ~ ., family = binomial, data = shootings.train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.662e+02 1.002e+02 -1.658 0.09731 .
## day_periodMorning -2.322e-01 1.385e-01 -1.676 0.09371 .
## day_periodEarlyAfternoon -3.859e-01 1.292e-01 -2.988 0.00281 **
## day_periodAfternoon -4.484e-01 1.057e-01 -4.243 2.20e-05 ***
## day_periodEvening -4.227e-01 1.025e-01 -4.126 3.70e-05 ***
## day_periodNight -4.198e-01 9.579e-02 -4.382 1.18e-05 ***
## day_year 1.228e-04 2.354e-04 0.521 0.60209
## year -9.052e-03 6.417e-03 -1.411 0.15833
## week_dayTuesday -3.710e-02 9.097e-02 -0.408 0.68343
## week_dayWednesday 9.868e-02 8.966e-02 1.101 0.27105
## week_dayThursday -1.424e-02 9.026e-02 -0.158 0.87468
## week_dayFriday 5.631e-02 8.695e-02 0.648 0.51721
## week_daySaturday -1.261e-01 8.511e-02 -1.481 0.13856
## week_daySunday -2.656e-02 8.442e-02 -0.315 0.75301
## COVID_lockdownTRUE -8.553e-02 1.308e-01 -0.654 0.51307
## COVID_pandemicTRUE 7.755e-02 8.769e-02 0.884 0.37654
## working_hourTRUE -5.431e-02 8.995e-02 -0.604 0.54599
## Latitude 1.967e+00 1.037e+00 1.897 0.05783 .
## Longitude -1.398e+00 1.129e+00 -1.238 0.21567
## city_locationN_Manhattan -2.087e-01 1.730e-01 -1.207 0.22751
## city_locationW_Bronx 2.593e-02 1.914e-01 0.135 0.89223
## city_locationE_Bronx -2.007e-02 2.379e-01 -0.084 0.93277
## city_locationN_E_Queens 2.618e-01 2.869e-01 0.912 0.36151
## city_locationN_W_Queens 8.406e-02 2.001e-01 0.420 0.67442
## city_locationS_Queens 4.580e-01 2.956e-01 1.550 0.12125
## city_locationN_Brooklyn 1.269e-01 1.609e-01 0.789 0.43026
## city_locationS_E_Brooklyn 2.283e-01 1.919e-01 1.190 0.23416
## city_locationS_W_Brooklyn 2.347e-01 2.041e-01 1.150 0.25015
## city_locationW_Staten_Island -4.238e-01 3.837e-01 -1.105 0.26934
## city_locationE_Staten_Island -2.363e-02 2.413e-01 -0.098 0.92199
## perp_age18-24 1.351e-01 8.353e-02 1.618 0.10575
## perp_age25-44 4.083e-01 8.540e-02 4.781 1.74e-06 ***
## perp_age45+ 6.788e-01 1.270e-01 5.343 9.16e-08 ***
## vic_age18-24 2.752e-01 8.688e-02 3.168 0.00154 **
## vic_age25-44 3.470e-01 8.680e-02 3.998 6.40e-05 ***
## vic_age45+ 3.084e-01 1.121e-01 2.751 0.00595 **
## perp_sexM -4.223e-02 1.278e-01 -0.330 0.74113
## vic_sexM -2.028e-02 7.069e-02 -0.287 0.77423
## perp_raceBLACK -5.226e-01 1.340e-01 -3.900 9.62e-05 ***
## perp_raceBLACK HISPANIC -6.411e-01 1.535e-01 -4.175 2.97e-05 ***
## perp_raceWHITE HISPANIC -4.881e-01 1.413e-01 -3.455 0.00055 ***
## vic_raceBLACK -7.186e-02 1.113e-01 -0.646 0.51853
## vic_raceBLACK HISPANIC -3.365e-01 1.310e-01 -2.570 0.01017 *
## vic_raceWHITE HISPANIC -9.342e-02 1.191e-01 -0.784 0.43287
## other_victims 1.347e-01 9.960e-03 13.520 < 2e-16 ***
## jurisdictionHOUSING -1.909e-01 6.724e-02 -2.839 0.00453 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12203 on 11097 degrees of freedom
## Residual deviance: 11796 on 11052 degrees of freedom
## AIC: 11888
##
## Number of Fisher Scoring iterations: 4
As we can see:
the day period is strongly significant for the victim survival.
the day of the year, the year, the day of the week, the COVID periods and the working hour are not significant.
for some reason Latitude is slightly significant but Longitude is not.
the city location is not significant at all.
the perpetrator age is significant.
the victim age is significant as expected.
sex predictors for both perpetrator and victim are not significant.
perpetrator race predictors is significant, while victim race is slightly significant.
the presence of addiction victims is strongly significant.
the jurisdiction is strongly significant.
First check multicollinearity:
vif(glm.full)
## GVIF Df GVIF^(1/(2*Df))
## day_period 2.325583 5 1.088061
## day_year 1.012866 1 1.006412
## year 2.214926 1 1.488263
## week_day 1.308980 6 1.022691
## COVID_lockdown 1.205156 1 1.097796
## COVID_pandemic 2.412109 1 1.553097
## working_hour 2.352406 1 1.533756
## Latitude 16.666168 1 4.082422
## Longitude 12.048878 1 3.471149
## city_location 223.957441 11 1.278868
## perp_age 1.263237 3 1.039715
## vic_age 1.271971 3 1.040909
## perp_sex 1.011028 1 1.005499
## vic_sex 1.050139 1 1.024763
## perp_race 1.663579 3 1.088530
## vic_race 1.670686 3 1.089304
## other_victims 1.068994 1 1.033922
## jurisdiction 1.060846 1 1.029974
As we can see city_location predictors has an extremely
large VIF and it is also not significant, we should remove it.
glm.full <- update(glm.full, .~.-city_location)
vif(glm.full)
## GVIF Df GVIF^(1/(2*Df))
## day_period 2.281159 5 1.085964
## day_year 1.010534 1 1.005253
## year 2.199714 1 1.483143
## week_day 1.294785 6 1.021762
## COVID_lockdown 1.200298 1 1.095581
## COVID_pandemic 2.396762 1 1.548148
## working_hour 2.347309 1 1.532093
## Latitude 1.167251 1 1.080394
## Longitude 1.069447 1 1.034141
## perp_age 1.255338 3 1.038628
## vic_age 1.264817 3 1.039931
## perp_sex 1.008670 1 1.004326
## vic_sex 1.048185 1 1.023809
## perp_race 1.600571 3 1.081548
## vic_race 1.624580 3 1.084235
## other_victims 1.064209 1 1.031605
## jurisdiction 1.029270 1 1.014530
Now we check for influential points:
influenceIndexPlot(glm.full, vars = "C")
The cook’s distance plot does not give indication of influential points.
Now let’s create a model with only significant predictors:
glm.sig <- glm(murder_prob ~ day_period + perp_age + vic_age + perp_race+ vic_race + other_victims + jurisdiction, data=shootings.train, family=binomial)
summary(glm.sig)
##
## Call:
## glm(formula = murder_prob ~ day_period + perp_age + vic_age +
## perp_race + vic_race + other_victims + jurisdiction, family = binomial,
## data = shootings.train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.879412 0.183232 -4.799 1.59e-06 ***
## day_periodMorning -0.248163 0.123962 -2.002 0.045293 *
## day_periodEarlyAfternoon -0.402931 0.113475 -3.551 0.000384 ***
## day_periodAfternoon -0.436644 0.101897 -4.285 1.83e-05 ***
## day_periodEvening -0.390479 0.101079 -3.863 0.000112 ***
## day_periodNight -0.400965 0.095378 -4.204 2.62e-05 ***
## perp_age18-24 0.132067 0.083237 1.587 0.112592
## perp_age25-44 0.404154 0.085004 4.755 1.99e-06 ***
## perp_age45+ 0.673634 0.126180 5.339 9.36e-08 ***
## vic_age18-24 0.270305 0.086313 3.132 0.001738 **
## vic_age25-44 0.330290 0.085939 3.843 0.000121 ***
## vic_age45+ 0.296612 0.111439 2.662 0.007776 **
## perp_raceBLACK -0.525160 0.132352 -3.968 7.25e-05 ***
## perp_raceBLACK HISPANIC -0.616717 0.151154 -4.080 4.50e-05 ***
## perp_raceWHITE HISPANIC -0.466753 0.139693 -3.341 0.000834 ***
## vic_raceBLACK -0.067016 0.109650 -0.611 0.541079
## vic_raceBLACK HISPANIC -0.311194 0.128657 -2.419 0.015573 *
## vic_raceWHITE HISPANIC -0.072053 0.117215 -0.615 0.538751
## other_victims 0.135840 0.009802 13.859 < 2e-16 ***
## jurisdictionHOUSING -0.206270 0.065742 -3.138 0.001703 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12203 on 11097 degrees of freedom
## Residual deviance: 11823 on 11078 degrees of freedom
## AIC: 11863
##
## Number of Fisher Scoring iterations: 4
AIC(glm.sig,glm.full)
## df AIC
## glm.sig 20 11863.08
## glm.full 35 11877.18
The AIC indicates that glm.sig is better than the full
model. Also the estimate coefficients are more interpretable.
Now we check for influential points:
influenceIndexPlot(glm.sig, vars = "C")
The cook’s distance plot does not give indication of influential points.
Now let’s take a look at some interactions between victim and
perpetrator. In particular we add to glm.sig the following
interactions:
glm.sign.inter <- update(glm.sig, . ~ . + perp_age:vic_age + perp_race:vic_race)
summary(glm.sign.inter)
##
## Call:
## glm(formula = murder_prob ~ day_period + perp_age + vic_age +
## perp_race + vic_race + other_victims + jurisdiction + perp_age:vic_age +
## perp_race:vic_race, family = binomial, data = shootings.train)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -0.613509 0.228981 -2.679
## day_periodMorning -0.249730 0.124689 -2.003
## day_periodEarlyAfternoon -0.403912 0.114039 -3.542
## day_periodAfternoon -0.439770 0.102440 -4.293
## day_periodEvening -0.399003 0.101668 -3.925
## day_periodNight -0.409595 0.095936 -4.269
## perp_age18-24 0.208068 0.178197 1.168
## perp_age25-44 0.379735 0.221629 1.713
## perp_age45+ 0.691423 0.602813 1.147
## vic_age18-24 0.197264 0.186343 1.059
## vic_age25-44 0.575041 0.197407 2.913
## vic_age45+ 0.274616 0.353817 0.776
## perp_raceBLACK -1.029365 0.217937 -4.723
## perp_raceBLACK HISPANIC -1.624632 0.508655 -3.194
## perp_raceWHITE HISPANIC -0.824480 0.270888 -3.044
## vic_raceBLACK -0.984678 0.327303 -3.008
## vic_raceBLACK HISPANIC -1.261441 0.524508 -2.405
## vic_raceWHITE HISPANIC -0.575297 0.323062 -1.781
## other_victims 0.137799 0.009864 13.970
## jurisdictionHOUSING -0.212252 0.065955 -3.218
## perp_age18-24:vic_age18-24 0.044046 0.221089 0.199
## perp_age25-44:vic_age18-24 0.163322 0.260440 0.627
## perp_age45+:vic_age18-24 0.137680 0.678234 0.203
## perp_age18-24:vic_age25-44 -0.273515 0.231128 -1.183
## perp_age25-44:vic_age25-44 -0.179458 0.263781 -0.680
## perp_age45+:vic_age25-44 -0.389557 0.632589 -0.616
## perp_age18-24:vic_age45+ -0.180258 0.398888 -0.452
## perp_age25-44:vic_age45+ 0.083334 0.408223 0.204
## perp_age45+:vic_age45+ 0.237229 0.702978 0.337
## perp_raceBLACK:vic_raceBLACK 1.122202 0.363885 3.084
## perp_raceBLACK HISPANIC:vic_raceBLACK 1.638309 0.597946 2.740
## perp_raceWHITE HISPANIC:vic_raceBLACK 1.013128 0.408489 2.480
## perp_raceBLACK:vic_raceBLACK HISPANIC 1.202486 0.557728 2.156
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 1.647694 0.733303 2.247
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 0.974565 0.587972 1.658
## perp_raceBLACK:vic_raceWHITE HISPANIC 0.713939 0.367815 1.941
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 1.265044 0.599812 2.109
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.543712 0.402118 1.352
## Pr(>|z|)
## (Intercept) 0.007378 **
## day_periodMorning 0.045196 *
## day_periodEarlyAfternoon 0.000397 ***
## day_periodAfternoon 1.76e-05 ***
## day_periodEvening 8.69e-05 ***
## day_periodNight 1.96e-05 ***
## perp_age18-24 0.242955
## perp_age25-44 0.086642 .
## perp_age45+ 0.251384
## vic_age18-24 0.289778
## vic_age25-44 0.003580 **
## vic_age45+ 0.437660
## perp_raceBLACK 2.32e-06 ***
## perp_raceBLACK HISPANIC 0.001403 **
## perp_raceWHITE HISPANIC 0.002338 **
## vic_raceBLACK 0.002626 **
## vic_raceBLACK HISPANIC 0.016173 *
## vic_raceWHITE HISPANIC 0.074951 .
## other_victims < 2e-16 ***
## jurisdictionHOUSING 0.001290 **
## perp_age18-24:vic_age18-24 0.842090
## perp_age25-44:vic_age18-24 0.530595
## perp_age45+:vic_age18-24 0.839137
## perp_age18-24:vic_age25-44 0.236655
## perp_age25-44:vic_age25-44 0.496297
## perp_age45+:vic_age25-44 0.538018
## perp_age18-24:vic_age45+ 0.651340
## perp_age25-44:vic_age45+ 0.838246
## perp_age45+:vic_age45+ 0.735768
## perp_raceBLACK:vic_raceBLACK 0.002043 **
## perp_raceBLACK HISPANIC:vic_raceBLACK 0.006146 **
## perp_raceWHITE HISPANIC:vic_raceBLACK 0.013132 *
## perp_raceBLACK:vic_raceBLACK HISPANIC 0.031080 *
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 0.024643 *
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 0.097418 .
## perp_raceBLACK:vic_raceWHITE HISPANIC 0.052255 .
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 0.034939 *
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.176337
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12203 on 11097 degrees of freedom
## Residual deviance: 11799 on 11060 degrees of freedom
## AIC: 11875
##
## Number of Fisher Scoring iterations: 4
As we can see the interaction between perpetrator race and victims race is slightly significant, while between perpetrator age and victim age is not. Let’s remove it.
glm.sign.inter <- update(glm.sign.inter, . ~ . - perp_age:vic_age)
summary(glm.sign.inter)
##
## Call:
## glm(formula = murder_prob ~ day_period + perp_age + vic_age +
## perp_race + vic_race + other_victims + jurisdiction + perp_race:vic_race,
## family = binomial, data = shootings.train)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -0.559980 0.202144 -2.770
## day_periodMorning -0.258822 0.124399 -2.081
## day_periodEarlyAfternoon -0.410170 0.113749 -3.606
## day_periodAfternoon -0.444091 0.102223 -4.344
## day_periodEvening -0.404341 0.101427 -3.987
## day_periodNight -0.412883 0.095703 -4.314
## perp_age18-24 0.131995 0.083283 1.585
## perp_age25-44 0.406821 0.085074 4.782
## perp_age45+ 0.659310 0.126704 5.204
## vic_age18-24 0.276794 0.086405 3.203
## vic_age25-44 0.339827 0.086071 3.948
## vic_age45+ 0.298527 0.111728 2.672
## perp_raceBLACK -1.052892 0.217318 -4.845
## perp_raceBLACK HISPANIC -1.624995 0.508221 -3.197
## perp_raceWHITE HISPANIC -0.836254 0.269996 -3.097
## vic_raceBLACK -0.995607 0.326765 -3.047
## vic_raceBLACK HISPANIC -1.286057 0.524276 -2.453
## vic_raceWHITE HISPANIC -0.578081 0.322142 -1.794
## other_victims 0.136781 0.009828 13.918
## jurisdictionHOUSING -0.208539 0.065880 -3.165
## perp_raceBLACK:vic_raceBLACK 1.141862 0.363324 3.143
## perp_raceBLACK HISPANIC:vic_raceBLACK 1.629987 0.597472 2.728
## perp_raceWHITE HISPANIC:vic_raceBLACK 1.020911 0.407687 2.504
## perp_raceBLACK:vic_raceBLACK HISPANIC 1.236925 0.557397 2.219
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 1.667519 0.732951 2.275
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 0.996136 0.587541 1.695
## perp_raceBLACK:vic_raceWHITE HISPANIC 0.723782 0.366942 1.972
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 1.254745 0.599077 2.094
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.544767 0.401143 1.358
## Pr(>|z|)
## (Intercept) 0.005602 **
## day_periodMorning 0.037472 *
## day_periodEarlyAfternoon 0.000311 ***
## day_periodAfternoon 1.40e-05 ***
## day_periodEvening 6.71e-05 ***
## day_periodNight 1.60e-05 ***
## perp_age18-24 0.112991
## perp_age25-44 1.74e-06 ***
## perp_age45+ 1.95e-07 ***
## vic_age18-24 0.001358 **
## vic_age25-44 7.87e-05 ***
## vic_age45+ 0.007542 **
## perp_raceBLACK 1.27e-06 ***
## perp_raceBLACK HISPANIC 0.001387 **
## perp_raceWHITE HISPANIC 0.001953 **
## vic_raceBLACK 0.002312 **
## vic_raceBLACK HISPANIC 0.014167 *
## vic_raceWHITE HISPANIC 0.072735 .
## other_victims < 2e-16 ***
## jurisdictionHOUSING 0.001548 **
## perp_raceBLACK:vic_raceBLACK 0.001673 **
## perp_raceBLACK HISPANIC:vic_raceBLACK 0.006369 **
## perp_raceWHITE HISPANIC:vic_raceBLACK 0.012274 *
## perp_raceBLACK:vic_raceBLACK HISPANIC 0.026479 *
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 0.022901 *
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 0.089993 .
## perp_raceBLACK:vic_raceWHITE HISPANIC 0.048556 *
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 0.036219 *
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.174452
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12203 on 11097 degrees of freedom
## Residual deviance: 11806 on 11069 degrees of freedom
## AIC: 11864
##
## Number of Fisher Scoring iterations: 4
AIC(glm.sign.inter, glm.sig)
## df AIC
## glm.sign.inter 29 11864.46
## glm.sig 20 11863.08
Since the coefficient of the interaction are significant and the AIC increase is negligible (only one unit) we prefer the model with interactions.
Now let’s try adding interaction terms to the full model. Now let’s add the following interactions:
glm.full.inter <- update(glm.full, . ~ . + perp_race:vic_race + perp_age:vic_age + perp_sex:vic_sex + year:day_year)
summary(glm.full.inter)
##
## Call:
## glm(formula = murder_prob ~ day_period + day_year + year + week_day +
## COVID_lockdown + COVID_pandemic + working_hour + Latitude +
## Longitude + perp_age + vic_age + perp_sex + vic_sex + perp_race +
## vic_race + other_victims + jurisdiction + perp_race:vic_race +
## perp_age:vic_age + perp_sex:vic_sex + day_year:year, family = binomial,
## data = shootings.train)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -2.107e+00 3.625e+01 -0.058
## day_periodMorning -2.345e-01 1.391e-01 -1.686
## day_periodEarlyAfternoon -3.812e-01 1.297e-01 -2.938
## day_periodAfternoon -4.435e-01 1.061e-01 -4.182
## day_periodEvening -4.218e-01 1.029e-01 -4.100
## day_periodNight -4.167e-01 9.624e-02 -4.329
## day_year 1.139e-01 9.206e-02 1.237
## year 2.336e-03 1.065e-02 0.219
## week_dayTuesday -4.600e-02 9.113e-02 -0.505
## week_dayWednesday 1.046e-01 8.979e-02 1.164
## week_dayThursday -1.110e-02 9.044e-02 -0.123
## week_dayFriday 6.615e-02 8.704e-02 0.760
## week_daySaturday -1.219e-01 8.518e-02 -1.431
## week_daySunday -2.592e-02 8.444e-02 -0.307
## COVID_lockdownTRUE -7.062e-02 1.307e-01 -0.540
## COVID_pandemicTRUE 6.581e-02 8.763e-02 0.751
## working_hourTRUE -5.089e-02 9.003e-02 -0.565
## Latitude 4.827e-01 2.763e-01 1.747
## Longitude 3.145e-01 3.360e-01 0.936
## perp_age18-24 2.100e-01 1.784e-01 1.177
## perp_age25-44 3.740e-01 2.220e-01 1.684
## perp_age45+ 6.880e-01 6.042e-01 1.139
## vic_age18-24 2.052e-01 1.868e-01 1.098
## vic_age25-44 5.792e-01 1.979e-01 2.927
## vic_age45+ 2.470e-01 3.546e-01 0.696
## perp_sexM 3.946e-01 3.376e-01 1.169
## vic_sexM 4.819e-01 3.578e-01 1.347
## perp_raceBLACK -1.047e+00 2.195e-01 -4.770
## perp_raceBLACK HISPANIC -1.662e+00 5.100e-01 -3.258
## perp_raceWHITE HISPANIC -8.420e-01 2.721e-01 -3.094
## vic_raceBLACK -1.003e+00 3.287e-01 -3.050
## vic_raceBLACK HISPANIC -1.278e+00 5.253e-01 -2.432
## vic_raceWHITE HISPANIC -6.147e-01 3.252e-01 -1.890
## other_victims 1.369e-01 1.002e-02 13.653
## jurisdictionHOUSING -2.050e-01 6.641e-02 -3.086
## perp_raceBLACK:vic_raceBLACK 1.145e+00 3.655e-01 3.134
## perp_raceBLACK HISPANIC:vic_raceBLACK 1.652e+00 5.993e-01 2.757
## perp_raceWHITE HISPANIC:vic_raceBLACK 1.010e+00 4.097e-01 2.464
## perp_raceBLACK:vic_raceBLACK HISPANIC 1.204e+00 5.587e-01 2.156
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 1.645e+00 7.345e-01 2.239
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 9.734e-01 5.888e-01 1.653
## perp_raceBLACK:vic_raceWHITE HISPANIC 7.349e-01 3.692e-01 1.990
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 1.289e+00 6.014e-01 2.143
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 5.662e-01 4.037e-01 1.402
## perp_age18-24:vic_age18-24 4.108e-02 2.214e-01 0.186
## perp_age25-44:vic_age18-24 1.672e-01 2.609e-01 0.641
## perp_age45+:vic_age18-24 1.612e-01 6.800e-01 0.237
## perp_age18-24:vic_age25-44 -2.667e-01 2.315e-01 -1.152
## perp_age25-44:vic_age25-44 -1.653e-01 2.644e-01 -0.625
## perp_age45+:vic_age25-44 -3.765e-01 6.340e-01 -0.594
## perp_age18-24:vic_age45+ -1.657e-01 3.998e-01 -0.414
## perp_age25-44:vic_age45+ 1.290e-01 4.095e-01 0.315
## perp_age45+:vic_age45+ 2.889e-01 7.046e-01 0.410
## perp_sexM:vic_sexM -5.145e-01 3.650e-01 -1.409
## day_year:year -5.650e-05 4.572e-05 -1.236
## Pr(>|z|)
## (Intercept) 0.95366
## day_periodMorning 0.09185 .
## day_periodEarlyAfternoon 0.00330 **
## day_periodAfternoon 2.89e-05 ***
## day_periodEvening 4.13e-05 ***
## day_periodNight 1.49e-05 ***
## day_year 0.21606
## year 0.82631
## week_dayTuesday 0.61367
## week_dayWednesday 0.24422
## week_dayThursday 0.90233
## week_dayFriday 0.44725
## week_daySaturday 0.15239
## week_daySunday 0.75889
## COVID_lockdownTRUE 0.58899
## COVID_pandemicTRUE 0.45261
## working_hourTRUE 0.57192
## Latitude 0.08059 .
## Longitude 0.34931
## perp_age18-24 0.23935
## perp_age25-44 0.09210 .
## perp_age45+ 0.25478
## vic_age18-24 0.27214
## vic_age25-44 0.00342 **
## vic_age45+ 0.48612
## perp_sexM 0.24257
## vic_sexM 0.17809
## perp_raceBLACK 1.84e-06 ***
## perp_raceBLACK HISPANIC 0.00112 **
## perp_raceWHITE HISPANIC 0.00197 **
## vic_raceBLACK 0.00229 **
## vic_raceBLACK HISPANIC 0.01501 *
## vic_raceWHITE HISPANIC 0.05873 .
## other_victims < 2e-16 ***
## jurisdictionHOUSING 0.00203 **
## perp_raceBLACK:vic_raceBLACK 0.00172 **
## perp_raceBLACK HISPANIC:vic_raceBLACK 0.00583 **
## perp_raceWHITE HISPANIC:vic_raceBLACK 0.01372 *
## perp_raceBLACK:vic_raceBLACK HISPANIC 0.03110 *
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 0.02513 *
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 0.09830 .
## perp_raceBLACK:vic_raceWHITE HISPANIC 0.04656 *
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 0.03213 *
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.16077
## perp_age18-24:vic_age18-24 0.85279
## perp_age25-44:vic_age18-24 0.52172
## perp_age45+:vic_age18-24 0.81259
## perp_age18-24:vic_age25-44 0.24922
## perp_age25-44:vic_age25-44 0.53197
## perp_age45+:vic_age25-44 0.55264
## perp_age18-24:vic_age45+ 0.67860
## perp_age25-44:vic_age45+ 0.75265
## perp_age45+:vic_age45+ 0.68180
## perp_sexM:vic_sexM 0.15874
## day_year:year 0.21657
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12203 on 11097 degrees of freedom
## Residual deviance: 11779 on 11043 degrees of freedom
## AIC: 11889
##
## Number of Fisher Scoring iterations: 4
As before the interaction between perpetrator race and victim race is significant, the interaction between perpetrator age and victim age is not significant. Furthermore the interaction between perpetrator sex and victim sex and the interaction between year and day of the year are not significant. Let’s remove the not significant interactions:
glm.full.inter <- update(glm.full.inter, . ~ . - perp_age:vic_age - perp_sex:vic_sex - year:day_year)
summary(glm.full.inter)
##
## Call:
## glm(formula = murder_prob ~ day_period + day_year + year + week_day +
## COVID_lockdown + COVID_pandemic + working_hour + Latitude +
## Longitude + perp_age + vic_age + perp_sex + vic_sex + perp_race +
## vic_race + other_victims + jurisdiction + perp_race:vic_race,
## family = binomial, data = shootings.train)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 20.7959594 31.8958533 0.652
## day_periodMorning -0.2456957 0.1387187 -1.771
## day_periodEarlyAfternoon -0.3911497 0.1293100 -3.025
## day_periodAfternoon -0.4521892 0.1057480 -4.276
## day_periodEvening -0.4292800 0.1025737 -4.185
## day_periodNight -0.4223721 0.0959247 -4.403
## day_year 0.0001051 0.0002355 0.446
## year -0.0085092 0.0064051 -1.329
## week_dayTuesday -0.0441346 0.0909882 -0.485
## week_dayWednesday 0.0981691 0.0896198 1.095
## week_dayThursday -0.0124163 0.0902670 -0.138
## week_dayFriday 0.0574862 0.0868911 0.662
## week_daySaturday -0.1250365 0.0850505 -1.470
## week_daySunday -0.0287566 0.0843183 -0.341
## COVID_lockdownTRUE -0.0798501 0.1305400 -0.612
## COVID_pandemicTRUE 0.0681454 0.0874755 0.779
## working_hourTRUE -0.0498608 0.0898993 -0.555
## Latitude 0.4716384 0.2759198 1.709
## Longitude 0.3163084 0.3355788 0.943
## perp_age18-24 0.1349025 0.0834756 1.616
## perp_age25-44 0.4110874 0.0853446 4.817
## perp_age45+ 0.6673513 0.1272876 5.243
## vic_age18-24 0.2796371 0.0868504 3.220
## vic_age25-44 0.3500667 0.0867783 4.034
## vic_age45+ 0.3038834 0.1121759 2.709
## perp_sexM -0.0375984 0.1280062 -0.294
## vic_sexM -0.0147228 0.0707723 -0.208
## perp_raceBLACK -1.0615720 0.2187197 -4.854
## perp_raceBLACK HISPANIC -1.6440918 0.5094168 -3.227
## perp_raceWHITE HISPANIC -0.8465790 0.2711553 -3.122
## vic_raceBLACK -1.0061478 0.3279735 -3.068
## vic_raceBLACK HISPANIC -1.2903632 0.5251927 -2.457
## vic_raceWHITE HISPANIC -0.5948548 0.3235285 -1.839
## other_victims 0.1368425 0.0099637 13.734
## jurisdictionHOUSING -0.2017840 0.0663307 -3.042
## perp_raceBLACK:vic_raceBLACK 1.1495145 0.3646584 3.152
## perp_raceBLACK HISPANIC:vic_raceBLACK 1.6223876 0.5986164 2.710
## perp_raceWHITE HISPANIC:vic_raceBLACK 1.0030521 0.4086859 2.454
## perp_raceBLACK:vic_raceBLACK HISPANIC 1.2198175 0.5584145 2.184
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 1.6372517 0.7340633 2.230
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 0.9769613 0.5884792 1.660
## perp_raceBLACK:vic_raceWHITE HISPANIC 0.7157150 0.3676400 1.947
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 1.2428665 0.6000116 2.071
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.5385956 0.4020950 1.339
## Pr(>|z|)
## (Intercept) 0.51440
## day_periodMorning 0.07653 .
## day_periodEarlyAfternoon 0.00249 **
## day_periodAfternoon 1.90e-05 ***
## day_periodEvening 2.85e-05 ***
## day_periodNight 1.07e-05 ***
## day_year 0.65546
## year 0.18401
## week_dayTuesday 0.62763
## week_dayWednesday 0.27334
## week_dayThursday 0.89060
## week_dayFriday 0.50823
## week_daySaturday 0.14152
## week_daySunday 0.73307
## COVID_lockdownTRUE 0.54074
## COVID_pandemicTRUE 0.43597
## working_hourTRUE 0.57915
## Latitude 0.08739 .
## Longitude 0.34590
## perp_age18-24 0.10608
## perp_age25-44 1.46e-06 ***
## perp_age45+ 1.58e-07 ***
## vic_age18-24 0.00128 **
## vic_age25-44 5.48e-05 ***
## vic_age45+ 0.00675 **
## perp_sexM 0.76897
## vic_sexM 0.83521
## perp_raceBLACK 1.21e-06 ***
## perp_raceBLACK HISPANIC 0.00125 **
## perp_raceWHITE HISPANIC 0.00180 **
## vic_raceBLACK 0.00216 **
## vic_raceBLACK HISPANIC 0.01401 *
## vic_raceWHITE HISPANIC 0.06597 .
## other_victims < 2e-16 ***
## jurisdictionHOUSING 0.00235 **
## perp_raceBLACK:vic_raceBLACK 0.00162 **
## perp_raceBLACK HISPANIC:vic_raceBLACK 0.00672 **
## perp_raceWHITE HISPANIC:vic_raceBLACK 0.01411 *
## perp_raceBLACK:vic_raceBLACK HISPANIC 0.02893 *
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 0.02572 *
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 0.09689 .
## perp_raceBLACK:vic_raceWHITE HISPANIC 0.05156 .
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 0.03832 *
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.18042
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12203 on 11097 degrees of freedom
## Residual deviance: 11791 on 11054 degrees of freedom
## AIC: 11879
##
## Number of Fisher Scoring iterations: 4
Check AIC:
AIC(glm.full)
## [1] 11877.18
AIC(glm.full.inter)
## [1] 11878.91
As before the AIC increase is negligible and the interaction between perpetrator and victim races is significant, thus we prefer this model compared to the full model.
In summary the models with interaction are preferable. Now let’s compare the two models with interaction:
AIC(glm.full.inter)
## [1] 11878.91
AIC(glm.sign.inter)
## [1] 11864.46
In terms of AIC we prefer glm.sign.inter: the model with
only significant predictors and the interaction term.
Let’s interpret what the best model in terms of AIC tell us.
summary(glm.sign.inter)
##
## Call:
## glm(formula = murder_prob ~ day_period + perp_age + vic_age +
## perp_race + vic_race + other_victims + jurisdiction + perp_race:vic_race,
## family = binomial, data = shootings.train)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -0.559980 0.202144 -2.770
## day_periodMorning -0.258822 0.124399 -2.081
## day_periodEarlyAfternoon -0.410170 0.113749 -3.606
## day_periodAfternoon -0.444091 0.102223 -4.344
## day_periodEvening -0.404341 0.101427 -3.987
## day_periodNight -0.412883 0.095703 -4.314
## perp_age18-24 0.131995 0.083283 1.585
## perp_age25-44 0.406821 0.085074 4.782
## perp_age45+ 0.659310 0.126704 5.204
## vic_age18-24 0.276794 0.086405 3.203
## vic_age25-44 0.339827 0.086071 3.948
## vic_age45+ 0.298527 0.111728 2.672
## perp_raceBLACK -1.052892 0.217318 -4.845
## perp_raceBLACK HISPANIC -1.624995 0.508221 -3.197
## perp_raceWHITE HISPANIC -0.836254 0.269996 -3.097
## vic_raceBLACK -0.995607 0.326765 -3.047
## vic_raceBLACK HISPANIC -1.286057 0.524276 -2.453
## vic_raceWHITE HISPANIC -0.578081 0.322142 -1.794
## other_victims 0.136781 0.009828 13.918
## jurisdictionHOUSING -0.208539 0.065880 -3.165
## perp_raceBLACK:vic_raceBLACK 1.141862 0.363324 3.143
## perp_raceBLACK HISPANIC:vic_raceBLACK 1.629987 0.597472 2.728
## perp_raceWHITE HISPANIC:vic_raceBLACK 1.020911 0.407687 2.504
## perp_raceBLACK:vic_raceBLACK HISPANIC 1.236925 0.557397 2.219
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 1.667519 0.732951 2.275
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 0.996136 0.587541 1.695
## perp_raceBLACK:vic_raceWHITE HISPANIC 0.723782 0.366942 1.972
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 1.254745 0.599077 2.094
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.544767 0.401143 1.358
## Pr(>|z|)
## (Intercept) 0.005602 **
## day_periodMorning 0.037472 *
## day_periodEarlyAfternoon 0.000311 ***
## day_periodAfternoon 1.40e-05 ***
## day_periodEvening 6.71e-05 ***
## day_periodNight 1.60e-05 ***
## perp_age18-24 0.112991
## perp_age25-44 1.74e-06 ***
## perp_age45+ 1.95e-07 ***
## vic_age18-24 0.001358 **
## vic_age25-44 7.87e-05 ***
## vic_age45+ 0.007542 **
## perp_raceBLACK 1.27e-06 ***
## perp_raceBLACK HISPANIC 0.001387 **
## perp_raceWHITE HISPANIC 0.001953 **
## vic_raceBLACK 0.002312 **
## vic_raceBLACK HISPANIC 0.014167 *
## vic_raceWHITE HISPANIC 0.072735 .
## other_victims < 2e-16 ***
## jurisdictionHOUSING 0.001548 **
## perp_raceBLACK:vic_raceBLACK 0.001673 **
## perp_raceBLACK HISPANIC:vic_raceBLACK 0.006369 **
## perp_raceWHITE HISPANIC:vic_raceBLACK 0.012274 *
## perp_raceBLACK:vic_raceBLACK HISPANIC 0.026479 *
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 0.022901 *
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 0.089993 .
## perp_raceBLACK:vic_raceWHITE HISPANIC 0.048556 *
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 0.036219 *
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.174452
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12203 on 11097 degrees of freedom
## Residual deviance: 11806 on 11069 degrees of freedom
## AIC: 11864
##
## Number of Fisher Scoring iterations: 4
Let’s plot the effects for all the interactionless variables:
a <- plot( effect("day_period", glm.sign.inter),rescale.axis=FALSE, ylab="Probability of murder")
b <- plot( effect("perp_age", glm.sign.inter), rescale.axis=FALSE, ylab="Probability of murder")
c <-plot( effect("vic_age", glm.sign.inter), rescale.axis=FALSE, ylab="Probability of murder")
d <-plot( effect("other_victims", glm.sign.inter), rescale.axis=FALSE, ylab="Probability of murder")
e <-plot( effect("jurisdiction", glm.sign.inter), rescale.axis=FALSE, ylab="Probability of murder")
grid.arrange(a,b,c,d,e , top=textGrob("Interactionless variables effect plots",gp=gpar(fontsize=20,font=3)))
As we can see:
during the early hours of the day it is less likely for the victim to survive a shooting incident.
the more the perpetrator is old, the less likely is for the victim to survive a shooting incident.
the more the victim is old, the less likely is for him/her to survive a shooting incident.
as the number of other victims increases, the less likely is for the victim to survive a shooting incident.
shootings incident with patrol jurisdiction have a higher probability of survival.
plot( effect("perp_race:vic_race", glm.sign.inter), rescale.axis=FALSE, ylab="Probability of murder")
As we can see if the victim is “Black Hispanic”, “White Hispanic” or “Black” the probability of survival doesn’t change much; while if the victim is “ASIAN” or “WHITE”, the probability of survival decreases if the shooter is also “ASIAN” or “WHITE.
Let’s compare all the models done so far in terms of prediction power to be able to compare them with other types of models:
glm.full.pred <- predict(glm.full, newdata = shootings.test, type = "response")
glm.sig.pred <- predict(glm.sig, newdata = shootings.test, type = "response")
glm.full.inter.pred <- predict(glm.full.inter, newdata = shootings.test, type = "response")
glm.sig.inter.pred <- predict(glm.sign.inter, newdata = shootings.test, type = "response")
Select the best threshold via ROC curve:
par(mfrow=c(2,2))
glm.full.roc <- roc(shootings.test$murder_prob ~ glm.full.pred, plot=TRUE, print.auc=TRUE, main="glm.full ROC curve")
glm.sig.roc <- roc(shootings.test$murder_prob ~ glm.sig.pred, plot=TRUE, print.auc=TRUE, main="glm.sig ROC curve")
glm.full.inter.roc <- roc(shootings.test$murder_prob ~ glm.full.inter.pred, plot=TRUE, print.auc=TRUE, main="glm.full.inter ROC curve")
glm.sig.inter.roc <- roc(shootings.test$murder_prob ~ glm.sig.inter.pred, plot=TRUE, print.auc=TRUE, main="glm.sign.inter ROC curve")
As we can see, the ROC curve is essentially the same for all the models.
glm.full.metrics <- coords(glm.full.roc, x="best", ret="all")
glm.sig.metrics <- coords(glm.sig.roc, x="best", ret="all")
glm.full.inter.metrics <- coords(glm.full.inter.roc, x="best", ret="all")
glm.sig.inter.metrics <- coords(glm.sig.inter.roc, x="best", ret="all")
row.names(glm.full.metrics) <- "Full model"
row.names(glm.sig.metrics) <- "Significant predictors model"
row.names(glm.full.inter.metrics) <- "Full model with interaction"
row.names(glm.sig.inter.metrics) <- "Significant predictors model with interaction"
metrics <- rbind(glm.full.metrics, glm.sig.metrics, glm.full.inter.metrics, glm.sig.inter.metrics)
Now let’s compare:
(metrics %>% arrange(desc(specificity)))[, c("specificity", "sensitivity", "accuracy")]
## specificity sensitivity accuracy
## Significant predictors model with interaction 0.5421003 0.6323752 0.5636036
## Full model with interaction 0.5397351 0.6323752 0.5618018
## Full model 0.5288553 0.6414523 0.5556757
## Significant predictors model 0.5056764 0.6611195 0.5427027
The best model in terms of specificity is “Significant predictors model with interaction”.
(metrics %>% arrange(desc(sensitivity)))[, c("specificity", "sensitivity", "accuracy")]
## specificity sensitivity accuracy
## Significant predictors model 0.5056764 0.6611195 0.5427027
## Full model 0.5288553 0.6414523 0.5556757
## Full model with interaction 0.5397351 0.6323752 0.5618018
## Significant predictors model with interaction 0.5421003 0.6323752 0.5636036
The best model in terms of sensitivity is “Significant predictors model”.
(metrics %>% arrange(desc(accuracy)))[, c("specificity", "sensitivity", "accuracy")]
## specificity sensitivity accuracy
## Significant predictors model with interaction 0.5421003 0.6323752 0.5636036
## Full model with interaction 0.5397351 0.6323752 0.5618018
## Full model 0.5288553 0.6414523 0.5556757
## Significant predictors model 0.5056764 0.6611195 0.5427027
The best model in terms of accuracy is “Significant predictors model with interaction”.
Since the response in unbalanced:
print(dfSummary(shootings.test[,"murder_prob"]), method="render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | murder_prob [numeric] |
|
|
2775 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19
The best model in terms of prediction power should maximize its ability to correctly classify murders. For this reason i should choose the model with higher sensitivity and acceptable specificity (>0.5): “Full model”.
confusion_matrix <- function(pred_prob, threshold){
pred_class <- pred_prob > threshold
table(preds=pred_class, true=as.logical(shootings.test$murder_prob))
}
glm.full.metrics[, c("specificity", "sensitivity", "accuracy")]
## specificity sensitivity accuracy
## Full model 0.5288553 0.6414523 0.5556757
confusion_matrix(glm.full.pred, glm.full.metrics$threshold)
## true
## preds FALSE TRUE
## FALSE 1118 237
## TRUE 996 424
glm.full.inter.metrics[, c("specificity", "sensitivity", "accuracy")]
## specificity sensitivity accuracy
## Full model with interaction 0.5397351 0.6323752 0.5618018
confusion_matrix(glm.full.inter.pred, glm.full.inter.metrics$threshold)
## true
## preds FALSE TRUE
## FALSE 1141 243
## TRUE 973 418
glm.sig.metrics[, c("specificity", "sensitivity", "accuracy")]
## specificity sensitivity accuracy
## Significant predictors model 0.5056764 0.6611195 0.5427027
confusion_matrix(glm.sig.pred, glm.sig.metrics$threshold)
## true
## preds FALSE TRUE
## FALSE 1069 224
## TRUE 1045 437
glm.sig.inter.metrics[, c("specificity", "sensitivity", "accuracy")]
## specificity sensitivity accuracy
## Significant predictors model with interaction 0.5421003 0.6323752 0.5636036
confusion_matrix(glm.sig.inter.pred, glm.sig.inter.metrics$threshold)
## true
## preds FALSE TRUE
## FALSE 1146 243
## TRUE 968 418
Now let’s apply automatic model selection to identity the best model.
We use the model which has the most predictors and interaction terms as starting point.
model <- glm(murder_prob ~ . + perp_race:vic_race + perp_age:vic_age + perp_sex:vic_sex + year:day_year, family = binomial, data = shootings.train)
glm.step <- step(model, direction = "both", trace = FALSE)
summary(glm.step)
##
## Call:
## glm(formula = murder_prob ~ day_period + year + Latitude + perp_age +
## vic_age + perp_race + vic_race + other_victims + jurisdiction,
## family = binomial, data = shootings.train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.787131 13.616581 -0.719 0.472285
## day_periodMorning -0.246401 0.123987 -1.987 0.046887 *
## day_periodEarlyAfternoon -0.397900 0.113502 -3.506 0.000455 ***
## day_periodAfternoon -0.436626 0.101885 -4.285 1.82e-05 ***
## day_periodEvening -0.394104 0.101080 -3.899 9.66e-05 ***
## day_periodNight -0.408608 0.095415 -4.282 1.85e-05 ***
## year -0.006386 0.004422 -1.444 0.148655
## Latitude 0.534628 0.267956 1.995 0.046020 *
## perp_age18-24 0.132106 0.083252 1.587 0.112554
## perp_age25-44 0.410781 0.085078 4.828 1.38e-06 ***
## perp_age45+ 0.685533 0.126370 5.425 5.80e-08 ***
## vic_age18-24 0.274380 0.086357 3.177 0.001487 **
## vic_age25-44 0.339776 0.086189 3.942 8.07e-05 ***
## vic_age45+ 0.307450 0.111640 2.754 0.005888 **
## perp_raceBLACK -0.531496 0.132681 -4.006 6.18e-05 ***
## perp_raceBLACK HISPANIC -0.640503 0.152269 -4.206 2.59e-05 ***
## perp_raceWHITE HISPANIC -0.487890 0.140572 -3.471 0.000519 ***
## vic_raceBLACK -0.074958 0.109876 -0.682 0.495111
## vic_raceBLACK HISPANIC -0.339954 0.129735 -2.620 0.008783 **
## vic_raceWHITE HISPANIC -0.096896 0.118128 -0.820 0.412069
## other_victims 0.134627 0.009818 13.712 < 2e-16 ***
## jurisdictionHOUSING -0.205217 0.065771 -3.120 0.001808 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12203 on 11097 degrees of freedom
## Residual deviance: 11817 on 11076 degrees of freedom
## AIC: 11861
##
## Number of Fisher Scoring iterations: 4
AIC(glm.full, glm.sig, glm.full.inter, glm.sign.inter, glm.step) %>% arrange(AIC)
## df AIC
## glm.step 22 11861.36
## glm.sig 20 11863.08
## glm.sign.inter 29 11864.46
## glm.full 35 11877.18
## glm.full.inter 44 11878.91
In terms of AIC the “Step model” is slightly better.
Let’s compare the step model with the models done so far.
glm.step.pred <- predict(glm.step, newdata = shootings.test, type = "response")
glm.step.roc <- roc(shootings.test$murder_prob ~ glm.step.pred, plot=TRUE, print.auc=TRUE, main="Step model ROC curve")
glm.step.metrics <- coords(glm.step.roc, x="best", ret="all")
rownames(glm.step.metrics) <- "Step model"
metrics <- rbind(metrics, glm.step.metrics)
(metrics %>% arrange(desc(sensitivity)))[, c("specificity", "sensitivity", "accuracy")]
## specificity sensitivity accuracy
## Step model 0.5132450 0.6641452 0.5491892
## Significant predictors model 0.5056764 0.6611195 0.5427027
## Full model 0.5288553 0.6414523 0.5556757
## Full model with interaction 0.5397351 0.6323752 0.5618018
## Significant predictors model with interaction 0.5421003 0.6323752 0.5636036
The “Step model” is slightly better in terms of sensitivity compared to the best overall model so far (“Full model”) and has an acceptable specificity: it is the new best overall model.
Confusion matrix for step model:
confusion_matrix(glm.step.pred, glm.step.metrics$threshold)
## true
## preds FALSE TRUE
## FALSE 1085 222
## TRUE 1029 439
Now we try with lasso regression. As before we use as starting point the formula with all predictors and interactions term.
lasso.mod <- glmnet(murder_prob ~ . + perp_race:vic_race + perp_age:vic_age + perp_sex:vic_sex + year:day_year, data=shootings.train, alpha=1, family = "binomial")
plot(lasso.mod, main='Path plot of the Lasso estimates\n\n')
We choose lambda using cross validation:
lasso.cv <- cv.glmnet(murder_prob ~ . + perp_race:vic_race + perp_age:vic_age + perp_sex:vic_sex + year:day_year, data=shootings.train, alpha=1, family = "binomial")
plot(lasso.cv)
Lasso gives us the choice between two values of lambda:
lambda.min: lambda of minimum mean cross-validated
error.
lambda.1se: largest value of lambda such that error
is within 1 standard error of the cross-validated errors for
lambda.min.
We explore both paths.
plot(lasso.mod, xvar = "lambda", main='Path plot of the Lasso estimates with 1se lambda\n\n')
abline(v = log(lasso.cv$lambda.1se), lty="dashed")
The remaining coefficients are:
lasso.1se.coef <- coef(lasso.mod, s=lasso.cv$lambda.1se)
lasso.1se.coef[lasso.1se.coef[,1]!=0,]
## (Intercept)
## -1.33784349
## day_periodEarlyMorning
## 0.07119012
## perp_age25-44
## 0.10728935
## perp_age45+
## 0.16398864
## vic_age<18
## -0.02929429
## other_victims
## 0.09101163
## perp_raceASIAN/WHITE:vic_raceASIAN/WHITE
## 0.43398945
plot(lasso.mod, xvar = "lambda", main='Path plot of the Lasso estimates with min lambda\n\n')
abline(v =log(lasso.cv$lambda.min), lty="dashed")
The remaining coefficients are:
lasso.min.coef <- coef(lasso.mod, s=lasso.cv$lambda.min)
lasso.min.coef[lasso.min.coef[,1]!=0,]
## (Intercept)
## -7.425535e+00
## day_periodEarlyMorning
## 3.345140e-01
## day_periodMorning
## 5.401413e-02
## day_periodAfternoon
## -1.020057e-03
## week_dayWednesday
## 3.365738e-02
## week_dayFriday
## 1.617189e-04
## week_daySaturday
## -4.362882e-02
## Latitude
## 1.437363e-01
## city_locationN_Manhattan
## -7.263793e-02
## city_locationW_Bronx
## 3.824675e-02
## city_locationW_Staten_Island
## -1.489629e-01
## perp_age<18
## -4.102680e-02
## perp_age25-44
## 2.432676e-01
## perp_age45+
## 3.707116e-01
## vic_age<18
## -2.256458e-01
## vic_age25-44
## 3.835733e-02
## perp_raceBLACK HISPANIC
## -3.554706e-02
## vic_raceBLACK HISPANIC
## -1.615043e-01
## other_victims
## 1.259900e-01
## jurisdictionPATROL
## 1.382466e-01
## jurisdictionHOUSING
## -2.125561e-13
## perp_raceASIAN/WHITE:vic_raceASIAN/WHITE
## 8.056032e-01
## perp_raceBLACK HISPANIC:vic_raceASIAN/WHITE
## -1.512426e-01
## perp_raceASIAN/WHITE:vic_raceWHITE HISPANIC
## 9.485748e-02
## perp_age<18:vic_age<18
## -4.363924e-02
## perp_age<18:vic_age18-24
## -1.007613e-01
## perp_age18-24:vic_age45+
## -1.783076e-03
## perp_age45+:vic_age45+
## 2.558898e-01
Now let’s compare prediction performance of lasso models.
lasso.1se.pred <- predict(lasso.mod, s=lasso.cv$lambda.1se, newdata = shootings.test, type = "response")
lasso.min.pred <- predict(lasso.mod, s=lasso.cv$lambda.min, newdata = shootings.test, type = "response")
par(mfrow=c(1,2))
lasso.1se.roc <- roc(shootings.test$murder_prob ~ lasso.1se.pred, plot=TRUE, print.auc=TRUE, main="Lasso 1se lambda ROC curve")
lasso.min.roc <- roc(shootings.test$murder_prob ~ lasso.min.pred, plot=TRUE, print.auc=TRUE, main="Lasso min lambda ROC curve")
According to the ROC curve the Lasso model with min lambda is slightly
better than the one with 1se lambda.
lasso.1se.metrics <- coords(lasso.1se.roc, x="best", ret="all")
row.names(lasso.1se.metrics) <- "Lasso 1se lambda"
lasso.min.metrics <- coords(lasso.min.roc, x="best", ret="all")
row.names(lasso.min.metrics) <- "Lasso min lambda"
metrics <- rbind(metrics, lasso.1se.metrics, lasso.min.metrics)
(metrics %>% arrange(desc(sensitivity)))[, c("specificity", "sensitivity", "accuracy")]
## specificity sensitivity accuracy
## Lasso 1se lambda 0.4550615 0.7049924 0.5145946
## Step model 0.5132450 0.6641452 0.5491892
## Significant predictors model 0.5056764 0.6611195 0.5427027
## Full model 0.5288553 0.6414523 0.5556757
## Full model with interaction 0.5397351 0.6323752 0.5618018
## Significant predictors model with interaction 0.5421003 0.6323752 0.5636036
## Lasso min lambda 0.6636708 0.5158850 0.6284685
As we can see “Lasso 1se lambda” is the best model in terms of sensitivity now. Unfortunately it is also the worst model in terms of accuracy and specificity. The model “Lasso min lambda” is the worst model in terms of sensitivity but has the higher specificity and accuracy.
Confusion Matrix for lambda.1se:
confusion_matrix(lasso.1se.pred, lasso.1se.metrics$threshold)
## true
## preds FALSE TRUE
## FALSE 962 195
## TRUE 1152 466
Confusion Matrix for lambda.min:
confusion_matrix(lasso.min.pred, lasso.min.metrics$threshold)
## true
## preds FALSE TRUE
## FALSE 1403 320
## TRUE 711 341
Now we try with ridge regression. As before we use as starting point the formula with all predictors and interaction term.
ridge.mod <- glmnet(murder_prob ~ . + perp_race:vic_race + perp_age:vic_age + perp_sex:vic_sex + year:day_year, data=shootings.train, alpha=0, family = binomial)
plot(ridge.mod, main='Path plot of the Ridge estimates\n\n')
We choose lambda using cross validation:
ridge.cv <- cv.glmnet(murder_prob ~ . + perp_race:vic_race + perp_age:vic_age + perp_sex:vic_sex + year:day_year, data=shootings.train, alpha=0, family = binomial)
plot(ridge.cv)
Ridge (as Lasso) gives us the choice between two values of lambda:
lambda.min: lambda of minimum mean cross-validated
error.
lambda.1se: largest value of lambda such that error
is within 1 standard error of the cross-validated errors for
lambda.min.
We explore both paths.
plot(ridge.mod, xvar = "lambda", main='Path plot of the Ridge estimates with 1se lambda\n\n')
abline(v =log(ridge.cv$lambda.1se), lty="dashed")
ridge.1se.coef <- coef(ridge.mod, s=ridge.cv$lambda.1se)
ridge.1se.coef
## 96 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 3.845321e+00
## day_periodEarlyMorning 1.388384e-01
## day_periodMorning 3.401170e-02
## day_periodEarlyAfternoon -1.462631e-02
## day_periodAfternoon -2.557651e-02
## day_periodEvening -8.282254e-03
## day_periodNight -1.052521e-02
## day_year 3.089940e-05
## year -1.302723e-03
## week_dayMonday 5.852612e-03
## week_dayTuesday -1.953402e-02
## week_dayWednesday 2.564763e-02
## week_dayThursday -4.587178e-04
## week_dayFriday 1.675245e-02
## week_daySaturday -2.601010e-02
## week_daySunday 4.776716e-03
## COVID_lockdownFALSE 4.110477e-03
## COVID_lockdownTRUE -4.109947e-03
## COVID_pandemicFALSE -3.101527e-03
## COVID_pandemicTRUE 3.101402e-03
## working_hourFALSE 7.875071e-03
## working_hourTRUE -7.875044e-03
## Latitude 1.139847e-01
## Longitude 9.593571e-02
## city_locationS_Manhattan 4.992745e-03
## city_locationN_Manhattan -4.756823e-02
## city_locationW_Bronx 3.146970e-02
## city_locationE_Bronx 1.558811e-02
## city_locationN_E_Queens -1.826869e-02
## city_locationN_W_Queens 2.009564e-03
## city_locationS_Queens 1.749562e-02
## city_locationN_Brooklyn -9.227367e-03
## city_locationS_E_Brooklyn -8.005881e-03
## city_locationS_W_Brooklyn 9.330641e-03
## city_locationW_Staten_Island -1.011926e-01
## city_locationE_Staten_Island -8.709403e-03
## perp_age<18 -7.144389e-02
## perp_age18-24 -4.581336e-02
## perp_age25-44 5.406089e-02
## perp_age45+ 1.267833e-01
## vic_age<18 -7.397795e-02
## vic_age18-24 -1.240023e-02
## vic_age25-44 3.036392e-02
## vic_age45+ 3.391347e-02
## perp_sexF 1.601497e-02
## perp_sexM -1.601363e-02
## vic_sexF 1.747560e-02
## vic_sexM -1.747380e-02
## perp_raceASIAN/WHITE 1.489891e-01
## perp_raceBLACK -1.471085e-02
## perp_raceBLACK HISPANIC -3.379992e-02
## perp_raceWHITE HISPANIC 9.781474e-03
## vic_raceASIAN/WHITE 6.193949e-02
## vic_raceBLACK -9.274449e-04
## vic_raceBLACK HISPANIC -5.675617e-02
## vic_raceWHITE HISPANIC 1.757999e-02
## other_victims 4.397806e-02
## jurisdictionPATROL 5.149342e-02
## jurisdictionHOUSING -5.149318e-02
## perp_raceASIAN/WHITE:vic_raceASIAN/WHITE 2.639927e-01
## perp_raceBLACK:vic_raceASIAN/WHITE -4.481049e-02
## perp_raceBLACK HISPANIC:vic_raceASIAN/WHITE -1.923715e-01
## perp_raceWHITE HISPANIC:vic_raceASIAN/WHITE 2.529882e-02
## perp_raceASIAN/WHITE:vic_raceBLACK -6.532200e-02
## perp_raceBLACK:vic_raceBLACK -1.499015e-03
## perp_raceBLACK HISPANIC:vic_raceBLACK -1.406054e-02
## perp_raceWHITE HISPANIC:vic_raceBLACK 2.125830e-02
## perp_raceASIAN/WHITE:vic_raceBLACK HISPANIC -1.213886e-01
## perp_raceBLACK:vic_raceBLACK HISPANIC -4.573116e-02
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC -7.383989e-02
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC -4.224034e-02
## perp_raceASIAN/WHITE:vic_raceWHITE HISPANIC 1.383977e-01
## perp_raceBLACK:vic_raceWHITE HISPANIC 8.594714e-03
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 2.610457e-03
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 1.822383e-02
## perp_age<18:vic_age<18 -1.095491e-01
## perp_age18-24:vic_age<18 -6.119831e-02
## perp_age25-44:vic_age<18 -1.942468e-02
## perp_age45+:vic_age<18 2.296742e-02
## perp_age<18:vic_age18-24 -7.664121e-02
## perp_age18-24:vic_age18-24 -2.897207e-02
## perp_age25-44:vic_age18-24 4.631818e-02
## perp_age45+:vic_age18-24 1.300625e-01
## perp_age<18:vic_age25-44 8.020179e-03
## perp_age18-24:vic_age25-44 -1.533855e-02
## perp_age25-44:vic_age25-44 4.219886e-02
## perp_age45+:vic_age25-44 7.499659e-02
## perp_age<18:vic_age45+ -7.101450e-02
## perp_age18-24:vic_age45+ -5.759527e-02
## perp_age25-44:vic_age45+ 3.684330e-02
## perp_age45+:vic_age45+ 2.020744e-01
## perp_sexF:vic_sexF -1.042000e-01
## perp_sexM:vic_sexF 2.385901e-02
## perp_sexF:vic_sexM 4.221193e-02
## perp_sexM:vic_sexM -2.328960e-02
## year:day_year 1.495699e-08
Differently from Lasso, Ridge doesn’t force to zero the coefficient thus is difficult to interpret those numbers.
plot(ridge.mod, xvar = "lambda", main='Path plot of the Ridge estimates with min lambda\n\n')
abline(v =log(ridge.cv$lambda.min), lty="dashed")
ridge.min.coef <- coef(ridge.mod, s=ridge.cv$lambda.min)
ridge.min.coef
## 96 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -5.901296e+00
## day_periodEarlyMorning 3.135773e-01
## day_periodMorning 9.849081e-02
## day_periodEarlyAfternoon -1.861883e-02
## day_periodAfternoon -5.443512e-02
## day_periodEvening -3.104734e-02
## day_periodNight -2.827166e-02
## day_year 5.775967e-05
## year -4.731804e-03
## week_dayMonday 1.188938e-02
## week_dayTuesday -2.966191e-02
## week_dayWednesday 8.232225e-02
## week_dayThursday 8.206412e-04
## week_dayFriday 5.329583e-02
## week_daySaturday -8.089602e-02
## week_daySunday -5.792982e-03
## COVID_lockdownFALSE 2.526777e-02
## COVID_lockdownTRUE -2.492615e-02
## COVID_pandemicFALSE -1.661065e-02
## COVID_pandemicTRUE 1.638272e-02
## working_hourFALSE 1.475393e-02
## working_hourTRUE -1.468342e-02
## Latitude 3.766273e-01
## Longitude 1.791574e-02
## city_locationS_Manhattan 9.093828e-04
## city_locationN_Manhattan -1.156430e-01
## city_locationW_Bronx 6.121018e-02
## city_locationE_Bronx 5.480075e-03
## city_locationN_E_Queens -3.173780e-02
## city_locationN_W_Queens -2.218201e-03
## city_locationS_Queens 5.199380e-02
## city_locationN_Brooklyn -3.284680e-03
## city_locationS_E_Brooklyn 8.959906e-05
## city_locationS_W_Brooklyn 2.523190e-02
## city_locationW_Staten_Island -3.156249e-01
## city_locationE_Staten_Island -2.178454e-02
## perp_age<18 -1.294470e-01
## perp_age18-24 -6.792880e-02
## perp_age25-44 8.816658e-02
## perp_age45+ 1.987393e-01
## vic_age<18 -1.414013e-01
## vic_age18-24 -9.364481e-03
## vic_age25-44 5.350341e-02
## vic_age45+ 3.656073e-02
## perp_sexF 1.128299e-02
## perp_sexM -1.146672e-02
## vic_sexF 6.355825e-03
## vic_sexM -6.758703e-03
## perp_raceASIAN/WHITE 2.288386e-01
## perp_raceBLACK -1.239356e-02
## perp_raceBLACK HISPANIC -6.732929e-02
## perp_raceWHITE HISPANIC 9.041020e-03
## vic_raceASIAN/WHITE 8.034759e-02
## vic_raceBLACK 1.497959e-02
## vic_raceBLACK HISPANIC -1.031781e-01
## vic_raceWHITE HISPANIC 1.735639e-02
## other_victims 1.092331e-01
## jurisdictionPATROL 8.883963e-02
## jurisdictionHOUSING -8.866483e-02
## perp_raceASIAN/WHITE:vic_raceASIAN/WHITE 5.127335e-01
## perp_raceBLACK:vic_raceASIAN/WHITE -1.356727e-01
## perp_raceBLACK HISPANIC:vic_raceASIAN/WHITE -5.444265e-01
## perp_raceWHITE HISPANIC:vic_raceASIAN/WHITE 9.018283e-03
## perp_raceASIAN/WHITE:vic_raceBLACK -2.496443e-01
## perp_raceBLACK:vic_raceBLACK 1.357133e-02
## perp_raceBLACK HISPANIC:vic_raceBLACK -1.954631e-02
## perp_raceWHITE HISPANIC:vic_raceBLACK 4.459676e-02
## perp_raceASIAN/WHITE:vic_raceBLACK HISPANIC -3.501667e-01
## perp_raceBLACK:vic_raceBLACK HISPANIC -6.776895e-02
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC -1.416040e-01
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC -8.677638e-02
## perp_raceASIAN/WHITE:vic_raceWHITE HISPANIC 1.152705e-01
## perp_raceBLACK:vic_raceWHITE HISPANIC 5.360135e-03
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 9.518035e-03
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 1.977844e-02
## perp_age<18:vic_age<18 -2.111777e-01
## perp_age18-24:vic_age<18 -1.019182e-01
## perp_age25-44:vic_age<18 -7.585244e-02
## perp_age45+:vic_age<18 5.114552e-02
## perp_age<18:vic_age18-24 -1.618143e-01
## perp_age18-24:vic_age18-24 -2.861700e-02
## perp_age25-44:vic_age18-24 8.467749e-02
## perp_age45+:vic_age18-24 2.576952e-01
## perp_age<18:vic_age25-44 7.116809e-02
## perp_age18-24:vic_age25-44 -2.026125e-02
## perp_age25-44:vic_age25-44 6.836001e-02
## perp_age45+:vic_age25-44 6.925437e-02
## perp_age<18:vic_age45+ -1.466736e-01
## perp_age18-24:vic_age45+ -1.658063e-01
## perp_age25-44:vic_age45+ 5.742714e-02
## perp_age45+:vic_age45+ 3.677524e-01
## perp_sexF:vic_sexF -2.954253e-01
## perp_sexM:vic_sexF 2.239350e-02
## perp_sexF:vic_sexM 7.833969e-02
## perp_sexM:vic_sexM -2.075992e-02
## year:day_year 2.648577e-08
ridge.1se.pred <- predict(ridge.mod, s=ridge.cv$lambda.1se, newdata = shootings.test, type = "response")
ridge.min.pred <- predict(ridge.mod, s=ridge.cv$lambda.min, newdata = shootings.test, type = "response")
par(mfrow=c(1,2))
ridge.1se.roc <- roc(shootings.test$murder_prob ~ ridge.1se.pred, plot=TRUE, print.auc=TRUE, main="1se lambda Ridge ROC curve")
ridge.min.roc <- roc(shootings.test$murder_prob ~ ridge.min.pred, plot=TRUE, print.auc=TRUE, main="min lambda Ridge ROC curve")
The ROC curves are almost identical.
ridge.1se.metrics <- coords(ridge.1se.roc, x="best", ret="all")
row.names(ridge.1se.metrics) <- "ridge 1se lambda"
ridge.min.metrics <- coords(ridge.min.roc, x="best", ret="all")
row.names(ridge.min.metrics) <- "ridge min lambda"
metrics <- rbind(metrics, ridge.1se.metrics, ridge.min.metrics)
(metrics %>% arrange(desc(sensitivity)))[, c("specificity", "sensitivity", "accuracy")]
## specificity sensitivity accuracy
## Lasso 1se lambda 0.4550615 0.7049924 0.5145946
## Step model 0.5132450 0.6641452 0.5491892
## Significant predictors model 0.5056764 0.6611195 0.5427027
## Full model 0.5288553 0.6414523 0.5556757
## Full model with interaction 0.5397351 0.6323752 0.5618018
## Significant predictors model with interaction 0.5421003 0.6323752 0.5636036
## ridge 1se lambda 0.5501419 0.6157337 0.5657658
## Lasso min lambda 0.6636708 0.5158850 0.6284685
## ridge min lambda 0.7270577 0.4372163 0.6580180
“Ridge min lambda” model is the worse in terms of sensitivity; while “Ridge 1se lambda” is below average in terms of sensitivity.
Confusion Matrix for lambda.1se:
confusion_matrix(ridge.1se.pred, ridge.1se.metrics$threshold)
## true
## preds FALSE TRUE
## FALSE 1163 254
## TRUE 951 407
Confusion Matrix for lambda.min:
confusion_matrix(ridge.min.pred, ridge.min.metrics$threshold)
## true
## preds FALSE TRUE
## FALSE 1537 372
## TRUE 577 289
nb.fit <- naiveBayes(formula(glm.full), data = shootings.train)
nb.fit
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.7611281 0.2388719
##
## Conditional probabilities:
## day_period
## Y EarlyMorning Morning EarlyAfternoon Afternoon Evening Night
## 0 0.04901148 0.06037647 0.10465254 0.20883154 0.20670060 0.37042737
## 1 0.07808374 0.06752169 0.09807620 0.19011694 0.20294229 0.36325915
##
## day_year
## Y [,1] [,2]
## 0 187.0831 96.53237
## 1 188.4889 98.96107
##
## year
## Y [,1] [,2]
## 0 2013.540 5.193262
## 1 2013.513 5.337454
##
## week_day
## Y Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## 0 0.1369717 0.1226471 0.1156624 0.1186220 0.1340121 0.1875222 0.1845626
## 1 0.1395700 0.1124104 0.1233497 0.1184459 0.1395700 0.1772916 0.1893625
##
## COVID_lockdown
## Y FALSE TRUE
## 0 0.96247188 0.03752812
## 1 0.96114674 0.03885326
##
## COVID_pandemic
## Y FALSE TRUE
## 0 0.8024151 0.1975849
## 1 0.7963033 0.2036967
##
## working_hour
## Y FALSE TRUE
## 0 0.8103469 0.1896531
## 1 0.8193135 0.1806865
##
## Latitude
## Y [,1] [,2]
## 0 40.74204 0.08895157
## 1 40.74534 0.08970732
##
## Longitude
## Y [,1] [,2]
## 0 -73.91299 0.06986026
## 1 -73.91067 0.07050189
##
## perp_age
## Y <18 18-24 25-44 45+
## 0 0.12241032 0.45720374 0.38179235 0.03859358
## 1 0.08675971 0.38966428 0.45492267 0.06865334
##
## vic_age
## Y <18 18-24 25-44 45+
## 0 0.11956908 0.35622114 0.44512845 0.07908133
## 1 0.08411920 0.33006413 0.48811769 0.09769898
##
## perp_sex
## Y F M
## 0 0.02935954 0.97064046
## 1 0.03432667 0.96567333
##
## vic_sex
## Y F M
## 0 0.1124660 0.8875340
## 1 0.1316484 0.8683516
##
## perp_race
## Y ASIAN/WHITE BLACK BLACK HISPANIC WHITE HISPANIC
## 0 0.02391382 0.73339647 0.08985439 0.15283533
## 1 0.04903810 0.70803470 0.07883817 0.16408902
##
## vic_race
## Y ASIAN/WHITE BLACK BLACK HISPANIC WHITE HISPANIC
## 0 0.04545993 0.67704510 0.10915118 0.16834379
## 1 0.06601283 0.66088269 0.08600528 0.18709921
##
## other_victims
## Y [,1] [,2]
## 0 1.048656 1.904262
## 1 1.719351 2.632683
##
## jurisdiction
## Y PATROL HOUSING
## 0 0.8318930 0.1681070
## 1 0.8675971 0.1324029
nb.pred <- predict(nb.fit, newdata = shootings.test, type = "raw")
nb.roc <- roc(shootings.test$murder_prob ~ nb.pred[,2], plot=TRUE, print.auc=TRUE, main="Naive Bayes ROC curve")
nb.metrics <- coords(nb.roc, x="best", ret="all")
row.names(nb.metrics) <- "Naive bayes"
metrics <- rbind(metrics, nb.metrics)
(metrics %>% arrange(desc(sensitivity)))[, c("specificity", "sensitivity", "accuracy")]
## specificity sensitivity accuracy
## Lasso 1se lambda 0.4550615 0.7049924 0.5145946
## Step model 0.5132450 0.6641452 0.5491892
## Significant predictors model 0.5056764 0.6611195 0.5427027
## Naive bayes 0.5444655 0.6444781 0.5682883
## Full model 0.5288553 0.6414523 0.5556757
## Full model with interaction 0.5397351 0.6323752 0.5618018
## Significant predictors model with interaction 0.5421003 0.6323752 0.5636036
## ridge 1se lambda 0.5501419 0.6157337 0.5657658
## Lasso min lambda 0.6636708 0.5158850 0.6284685
## ridge min lambda 0.7270577 0.4372163 0.6580180
Naive Bayes has high sensitivity but not the best one and also has a pretty okey specificity.
Confusion matrix for Naive Bayes:
confusion_matrix(nb.pred[,2], nb.metrics$threshold)
## true
## preds FALSE TRUE
## FALSE 1151 235
## TRUE 963 426
Now let’s fit a GAM model with all predictors (interaction term included). I applied splines to all the numerical predictors.
f <- update(formula(glm.full.inter), . ~ . + perp_age:vic_age + perp_sex:vic_sex -year -day_year -Latitude -Longitude + s(year) + s(day_year, bs="cc") + s(Latitude) + s(Longitude) -other_victims + s(other_victims))
gam.fit <- gam(f, data=shootings.train, family=binomial)
summary(gam.fit)
##
## Family: binomial
## Link function: logit
##
## Formula:
## murder_prob ~ day_period + week_day + COVID_lockdown + COVID_pandemic +
## working_hour + perp_age + vic_age + perp_sex + vic_sex +
## perp_race + vic_race + jurisdiction + s(year) + s(day_year,
## bs = "cc") + s(Latitude) + s(Longitude) + s(other_victims) +
## perp_race:vic_race + perp_age:vic_age + perp_sex:vic_sex
##
## Parametric coefficients:
## Estimate Std. Error z value
## (Intercept) -0.728455 0.410341 -1.775
## day_periodMorning -0.226595 0.140650 -1.611
## day_periodEarlyAfternoon -0.350309 0.130967 -2.675
## day_periodAfternoon -0.422978 0.107230 -3.945
## day_periodEvening -0.384590 0.104242 -3.689
## day_periodNight -0.370610 0.097612 -3.797
## week_dayTuesday -0.045968 0.091956 -0.500
## week_dayWednesday 0.116609 0.090848 1.284
## week_dayThursday 0.006966 0.091375 0.076
## week_dayFriday 0.062237 0.088047 0.707
## week_daySaturday -0.112104 0.086299 -1.299
## week_daySunday -0.015481 0.085427 -0.181
## COVID_lockdownTRUE 0.056426 0.175569 0.321
## COVID_pandemicTRUE -0.306546 0.236900 -1.294
## working_hourTRUE -0.069236 0.091084 -0.760
## perp_age18-24 0.224478 0.179585 1.250
## perp_age25-44 0.395946 0.223136 1.774
## perp_age45+ 0.692933 0.613908 1.129
## vic_age18-24 0.213367 0.188010 1.135
## vic_age25-44 0.628756 0.199414 3.153
## vic_age45+ 0.274866 0.358418 0.767
## perp_sexM 0.382277 0.337087 1.134
## vic_sexM 0.564608 0.358106 1.577
## perp_raceBLACK -1.066301 0.221670 -4.810
## perp_raceBLACK HISPANIC -1.668204 0.511983 -3.258
## perp_raceWHITE HISPANIC -0.882037 0.276411 -3.191
## vic_raceBLACK -0.984463 0.330294 -2.981
## vic_raceBLACK HISPANIC -1.295326 0.529115 -2.448
## vic_raceWHITE HISPANIC -0.624343 0.325836 -1.916
## jurisdictionHOUSING -0.199834 0.067017 -2.982
## perp_raceBLACK:vic_raceBLACK 1.103346 0.367819 3.000
## perp_raceBLACK HISPANIC:vic_raceBLACK 1.592625 0.601762 2.647
## perp_raceWHITE HISPANIC:vic_raceBLACK 1.022564 0.413815 2.471
## perp_raceBLACK:vic_raceBLACK HISPANIC 1.233328 0.562835 2.191
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 1.657084 0.738430 2.244
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 1.059277 0.594133 1.783
## perp_raceBLACK:vic_raceWHITE HISPANIC 0.759589 0.370246 2.052
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 1.261960 0.603553 2.091
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.569182 0.406294 1.401
## perp_age18-24:vic_age18-24 0.020166 0.222845 0.090
## perp_age25-44:vic_age18-24 0.116118 0.262423 0.442
## perp_age45+:vic_age18-24 0.192137 0.690454 0.278
## perp_age18-24:vic_age25-44 -0.327887 0.233299 -1.405
## perp_age25-44:vic_age25-44 -0.256357 0.266305 -0.963
## perp_age45+:vic_age25-44 -0.467323 0.644026 -0.726
## perp_age18-24:vic_age45+ -0.124813 0.403515 -0.309
## perp_age25-44:vic_age45+ 0.094252 0.413197 0.228
## perp_age45+:vic_age45+ 0.202144 0.715398 0.283
## perp_sexM:vic_sexM -0.596309 0.365515 -1.631
## Pr(>|z|)
## (Intercept) 0.075857 .
## day_periodMorning 0.107167
## day_periodEarlyAfternoon 0.007478 **
## day_periodAfternoon 7.99e-05 ***
## day_periodEvening 0.000225 ***
## day_periodNight 0.000147 ***
## week_dayTuesday 0.617151
## week_dayWednesday 0.199293
## week_dayThursday 0.939232
## week_dayFriday 0.479655
## week_daySaturday 0.193939
## week_daySunday 0.856197
## COVID_lockdownTRUE 0.747917
## COVID_pandemicTRUE 0.195669
## working_hourTRUE 0.447177
## perp_age18-24 0.211304
## perp_age25-44 0.075987 .
## perp_age45+ 0.259014
## vic_age18-24 0.256430
## vic_age25-44 0.001616 **
## vic_age45+ 0.443149
## perp_sexM 0.256769
## vic_sexM 0.114876
## perp_raceBLACK 1.51e-06 ***
## perp_raceBLACK HISPANIC 0.001121 **
## perp_raceWHITE HISPANIC 0.001418 **
## vic_raceBLACK 0.002877 **
## vic_raceBLACK HISPANIC 0.014361 *
## vic_raceWHITE HISPANIC 0.055349 .
## jurisdictionHOUSING 0.002865 **
## perp_raceBLACK:vic_raceBLACK 0.002702 **
## perp_raceBLACK HISPANIC:vic_raceBLACK 0.008130 **
## perp_raceWHITE HISPANIC:vic_raceBLACK 0.013471 *
## perp_raceBLACK:vic_raceBLACK HISPANIC 0.028432 *
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 0.024828 *
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 0.074604 .
## perp_raceBLACK:vic_raceWHITE HISPANIC 0.040210 *
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 0.036538 *
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.161240
## perp_age18-24:vic_age18-24 0.927897
## perp_age25-44:vic_age18-24 0.658138
## perp_age45+:vic_age18-24 0.780800
## perp_age18-24:vic_age25-44 0.159892
## perp_age25-44:vic_age25-44 0.335726
## perp_age45+:vic_age25-44 0.468067
## perp_age18-24:vic_age45+ 0.757083
## perp_age25-44:vic_age45+ 0.819565
## perp_age45+:vic_age45+ 0.777513
## perp_sexM:vic_sexM 0.102801
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(year) 7.403 8.348 22.560 0.00509 **
## s(day_year) 6.599 8.000 16.504 0.01192 *
## s(Latitude) 1.001 1.003 2.775 0.09614 .
## s(Longitude) 1.000 1.001 1.318 0.25105
## s(other_victims) 8.837 8.978 290.618 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0462 Deviance explained = 4.69%
## UBRE = 0.061295 Scale est. = 1 n = 11098
Plot of the splines effect:
par(mfrow=c(2,3))
for (i in 1:5){
plot(gam.fit, select=i, shade=TRUE, shade.col = "lightblue")
abline(h=0, lty="dashed")
}
As we can see both for the model summary and the plots the effect of
Latitude and Longitude seams linear;
year and day_year seams periodic and
other_victims is highly non linear.
Now let’s try fitting a GAM model now with only significant predictors:
gam2.fit <- update(gam.fit, . ~ . - s(Latitude) -s(Longitude) - week_day -COVID_lockdown -COVID_pandemic -working_hour -vic_sex -perp_sex - perp_age:vic_age - perp_sex:vic_sex)
summary(gam2.fit)
##
## Family: binomial
## Link function: logit
##
## Formula:
## murder_prob ~ day_period + perp_age + vic_age + perp_race + vic_race +
## jurisdiction + s(year) + s(day_year, bs = "cc") + s(other_victims) +
## perp_race:vic_race
##
## Parametric coefficients:
## Estimate Std. Error z value
## (Intercept) -0.41239 0.20423 -2.019
## day_periodMorning -0.26084 0.12591 -2.072
## day_periodEarlyAfternoon -0.39040 0.11492 -3.397
## day_periodAfternoon -0.42978 0.10344 -4.155
## day_periodEvening -0.36603 0.10273 -3.563
## day_periodNight -0.37110 0.09711 -3.821
## perp_age18-24 0.12683 0.08403 1.509
## perp_age25-44 0.38626 0.08598 4.492
## perp_age45+ 0.62042 0.12863 4.823
## vic_age18-24 0.26904 0.08705 3.091
## vic_age25-44 0.33974 0.08701 3.905
## vic_age45+ 0.32995 0.11303 2.919
## perp_raceBLACK -1.04910 0.22011 -4.766
## perp_raceBLACK HISPANIC -1.62322 0.51076 -3.178
## perp_raceWHITE HISPANIC -0.86320 0.27436 -3.146
## vic_raceBLACK -0.95882 0.32881 -2.916
## vic_raceBLACK HISPANIC -1.29250 0.52898 -2.443
## vic_raceWHITE HISPANIC -0.58978 0.32290 -1.827
## jurisdictionHOUSING -0.20106 0.06654 -3.022
## perp_raceBLACK:vic_raceBLACK 1.07409 0.36637 2.932
## perp_raceBLACK HISPANIC:vic_raceBLACK 1.55815 0.60051 2.595
## perp_raceWHITE HISPANIC:vic_raceBLACK 1.02387 0.41187 2.486
## perp_raceBLACK:vic_raceBLACK HISPANIC 1.24700 0.56264 2.216
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 1.67212 0.73796 2.266
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 1.06836 0.59351 1.800
## perp_raceBLACK:vic_raceWHITE HISPANIC 0.74504 0.36809 2.024
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 1.23950 0.60167 2.060
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.55419 0.40375 1.373
## Pr(>|z|)
## (Intercept) 0.043456 *
## day_periodMorning 0.038298 *
## day_periodEarlyAfternoon 0.000681 ***
## day_periodAfternoon 3.25e-05 ***
## day_periodEvening 0.000367 ***
## day_periodNight 0.000133 ***
## perp_age18-24 0.131212
## perp_age25-44 7.04e-06 ***
## perp_age45+ 1.41e-06 ***
## vic_age18-24 0.001997 **
## vic_age25-44 9.44e-05 ***
## vic_age45+ 0.003512 **
## perp_raceBLACK 1.88e-06 ***
## perp_raceBLACK HISPANIC 0.001483 **
## perp_raceWHITE HISPANIC 0.001654 **
## vic_raceBLACK 0.003545 **
## vic_raceBLACK HISPANIC 0.014550 *
## vic_raceWHITE HISPANIC 0.067769 .
## jurisdictionHOUSING 0.002513 **
## perp_raceBLACK:vic_raceBLACK 0.003371 **
## perp_raceBLACK HISPANIC:vic_raceBLACK 0.009467 **
## perp_raceWHITE HISPANIC:vic_raceBLACK 0.012923 *
## perp_raceBLACK:vic_raceBLACK HISPANIC 0.026670 *
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 0.023459 *
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 0.071850 .
## perp_raceBLACK:vic_raceWHITE HISPANIC 0.042963 *
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 0.039390 *
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.169877
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(year) 6.943 8.011 19.0 0.0150 *
## s(day_year) 6.630 8.000 17.6 0.0078 **
## s(other_victims) 8.828 8.975 293.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0462 Deviance explained = 4.46%
## UBRE = 0.059628 Scale est. = 1 n = 11098
par(mfrow=c(1,3))
for (i in 1:5){
plot(gam2.fit, select=i, shade=TRUE, shade.col = "lightblue")
abline(h=0, lty="dashed")
}
The splines effects doesn’t change much compared to the previews
model.
gam.predict <- predict(gam.fit, newdata = shootings.test, type = "response")
gam2.predict <- predict(gam2.fit, newdata = shootings.test, type = "response")
par(mfrow=c(1,2))
gam.roc <- roc(shootings.test$murder_prob ~ gam.predict, plot=TRUE, print.auc=TRUE, main="GAM 1 ROC curve")
gam2.roc <- roc(shootings.test$murder_prob ~ gam2.predict, plot=TRUE, print.auc=TRUE, main="GAM 2 ROC curve")
The ROC curves are almost identical.
gam.metrics <- coords(gam.roc, x="best", ret="all")
gam2.metrics <- coords(gam2.roc, x="best", ret="all")
row.names(gam.metrics) <- "GAM1"
row.names(gam2.metrics) <- "GAM2"
metrics <- rbind(metrics, gam.metrics, gam2.metrics)
(metrics %>% arrange(desc(sensitivity)))[, c("specificity", "sensitivity", "accuracy")]
## specificity sensitivity accuracy
## Lasso 1se lambda 0.4550615 0.7049924 0.5145946
## Step model 0.5132450 0.6641452 0.5491892
## Significant predictors model 0.5056764 0.6611195 0.5427027
## Naive bayes 0.5444655 0.6444781 0.5682883
## Full model 0.5288553 0.6414523 0.5556757
## GAM2 0.5742668 0.6384266 0.5895495
## Full model with interaction 0.5397351 0.6323752 0.5618018
## Significant predictors model with interaction 0.5421003 0.6323752 0.5636036
## ridge 1se lambda 0.5501419 0.6157337 0.5657658
## GAM1 0.5979186 0.6051437 0.5996396
## Lasso min lambda 0.6636708 0.5158850 0.6284685
## ridge min lambda 0.7270577 0.4372163 0.6580180
“GAM1” model is the worst in terms of sensitivity but gives the best results in accuracy and specificity; while “GAM 2” is average in terms of sensitivity but gives the second best accuracy.
Let’s find the best model for all the metrics:
(metrics %>% arrange(desc(specificity)))[, c("specificity", "sensitivity", "accuracy")]
## specificity sensitivity accuracy
## ridge min lambda 0.7270577 0.4372163 0.6580180
## Lasso min lambda 0.6636708 0.5158850 0.6284685
## GAM1 0.5979186 0.6051437 0.5996396
## GAM2 0.5742668 0.6384266 0.5895495
## ridge 1se lambda 0.5501419 0.6157337 0.5657658
## Naive bayes 0.5444655 0.6444781 0.5682883
## Significant predictors model with interaction 0.5421003 0.6323752 0.5636036
## Full model with interaction 0.5397351 0.6323752 0.5618018
## Full model 0.5288553 0.6414523 0.5556757
## Step model 0.5132450 0.6641452 0.5491892
## Significant predictors model 0.5056764 0.6611195 0.5427027
## Lasso 1se lambda 0.4550615 0.7049924 0.5145946
The best model in terms of specificity is “Lasso min lambda”.
(metrics %>% arrange(desc(sensitivity)))[, c("specificity", "sensitivity", "accuracy")]
## specificity sensitivity accuracy
## Lasso 1se lambda 0.4550615 0.7049924 0.5145946
## Step model 0.5132450 0.6641452 0.5491892
## Significant predictors model 0.5056764 0.6611195 0.5427027
## Naive bayes 0.5444655 0.6444781 0.5682883
## Full model 0.5288553 0.6414523 0.5556757
## GAM2 0.5742668 0.6384266 0.5895495
## Full model with interaction 0.5397351 0.6323752 0.5618018
## Significant predictors model with interaction 0.5421003 0.6323752 0.5636036
## ridge 1se lambda 0.5501419 0.6157337 0.5657658
## GAM1 0.5979186 0.6051437 0.5996396
## Lasso min lambda 0.6636708 0.5158850 0.6284685
## ridge min lambda 0.7270577 0.4372163 0.6580180
The best model in terms of sensitivity is “Lasso 1se lambda”. Unfortunately it has a not acceptable specificity; thus we still prefer the “Step model”.
(metrics %>% arrange(desc(accuracy)))[, c("specificity", "sensitivity", "accuracy")]
## specificity sensitivity accuracy
## ridge min lambda 0.7270577 0.4372163 0.6580180
## Lasso min lambda 0.6636708 0.5158850 0.6284685
## GAM1 0.5979186 0.6051437 0.5996396
## GAM2 0.5742668 0.6384266 0.5895495
## Naive bayes 0.5444655 0.6444781 0.5682883
## ridge 1se lambda 0.5501419 0.6157337 0.5657658
## Significant predictors model with interaction 0.5421003 0.6323752 0.5636036
## Full model with interaction 0.5397351 0.6323752 0.5618018
## Full model 0.5288553 0.6414523 0.5556757
## Step model 0.5132450 0.6641452 0.5491892
## Significant predictors model 0.5056764 0.6611195 0.5427027
## Lasso 1se lambda 0.4550615 0.7049924 0.5145946
The best model in terms of accuracy is “Lasso min lambda”.
Since the response in unbalanced:
print(dfSummary(shootings.test[,"murder_prob"]), method="render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | murder_prob [numeric] |
|
|
2775 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19
The best model in terms of prediction power should maximize its ability to correctly classify murders. For this reason i should choose the model with higher sensitivity and acceptable specificity (>0.5): “Step model”:
summary(glm.step)
##
## Call:
## glm(formula = murder_prob ~ day_period + year + Latitude + perp_age +
## vic_age + perp_race + vic_race + other_victims + jurisdiction,
## family = binomial, data = shootings.train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.787131 13.616581 -0.719 0.472285
## day_periodMorning -0.246401 0.123987 -1.987 0.046887 *
## day_periodEarlyAfternoon -0.397900 0.113502 -3.506 0.000455 ***
## day_periodAfternoon -0.436626 0.101885 -4.285 1.82e-05 ***
## day_periodEvening -0.394104 0.101080 -3.899 9.66e-05 ***
## day_periodNight -0.408608 0.095415 -4.282 1.85e-05 ***
## year -0.006386 0.004422 -1.444 0.148655
## Latitude 0.534628 0.267956 1.995 0.046020 *
## perp_age18-24 0.132106 0.083252 1.587 0.112554
## perp_age25-44 0.410781 0.085078 4.828 1.38e-06 ***
## perp_age45+ 0.685533 0.126370 5.425 5.80e-08 ***
## vic_age18-24 0.274380 0.086357 3.177 0.001487 **
## vic_age25-44 0.339776 0.086189 3.942 8.07e-05 ***
## vic_age45+ 0.307450 0.111640 2.754 0.005888 **
## perp_raceBLACK -0.531496 0.132681 -4.006 6.18e-05 ***
## perp_raceBLACK HISPANIC -0.640503 0.152269 -4.206 2.59e-05 ***
## perp_raceWHITE HISPANIC -0.487890 0.140572 -3.471 0.000519 ***
## vic_raceBLACK -0.074958 0.109876 -0.682 0.495111
## vic_raceBLACK HISPANIC -0.339954 0.129735 -2.620 0.008783 **
## vic_raceWHITE HISPANIC -0.096896 0.118128 -0.820 0.412069
## other_victims 0.134627 0.009818 13.712 < 2e-16 ***
## jurisdictionHOUSING -0.205217 0.065771 -3.120 0.001808 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12203 on 11097 degrees of freedom
## Residual deviance: 11817 on 11076 degrees of freedom
## AIC: 11861
##
## Number of Fisher Scoring iterations: 4
metrics["Step model", c("specificity", "sensitivity", "accuracy")]
## specificity sensitivity accuracy
## Step model 0.513245 0.6641452 0.5491892
Confusion matrix:
confusion_matrix(glm.step.pred, glm.step.metrics$threshold)
## true
## preds FALSE TRUE
## FALSE 1085 222
## TRUE 1029 439
Now let’s answer the project proposal questions using the best overall model effects.
Let’s plot the effects for all the predictors:
a <- plot( effect("day_period", glm.step),rescale.axis=FALSE, ylab="Probability of murder")
b <- plot( effect("year", glm.step),rescale.axis=FALSE, ylab="Probability of murder")
c <- plot( effect("Latitude", glm.step),rescale.axis=FALSE, ylab="Probability of murder")
d <- plot( effect("perp_age", glm.step), rescale.axis=FALSE, ylab="Probability of murder")
e <-plot( effect("vic_age", glm.step), rescale.axis=FALSE, ylab="Probability of murder")
f <- plot( effect("perp_race", glm.step), rescale.axis=FALSE, ylab="Probability of murder")
g <-plot( effect("vic_race", glm.step), rescale.axis=FALSE, ylab="Probability of murder")
h <-plot( effect("other_victims", glm.step), rescale.axis=FALSE, ylab="Probability of murder")
i <-plot( effect("jurisdiction", glm.step), rescale.axis=FALSE, ylab="Probability of murder")
grid.arrange(a,b,c,d,e,f,g,h,i, top=textGrob("Predictors effect plots",gp=gpar(fontsize=20,font=3)))
The initial question were:
Question (1): is survival more likely in a specific neighborhood?
The model doesn’t give information about neighborhood as the associated predictor was removed after the multicollinearity analysis. Thus the probability of murder across neighborhood is the same according to this model.
Question (2): is it more likely to survive based on gender/age of the victim? (obviously younger victims should be more likely to survive.)
The model doesn’t give information about victim sex as the associated predictor was removed by step function. Thus the probability of murder is the same for both male and female victims according to this model. Instead we can say something about the age of the victim: clearly there is substantial increase in murder probability when we increase the victim age from <18 to 18-24 years old but from 18-24 to 25-44 and from 25-44 to 45+ there is not a substantial increase. Also the confidence band tells us that the probability of murders from 18 to 45+ years is more or less the same.
Question (3): is it more likely to survive based on the victim’s ethnicity?
The model tells us that it is more likely for a Black Hispanic to survive compared to other races but with a very small effect. Also, watching the confidence bands the probability of murder if the victims are Asian or White or Black or White Hispanic is more or less the same.
Question (4): is it more likely to survive if the shooter belongs to a different/same ethnicity than the victim?
The model doesn’t give information about interaction between ethnicity as the associated predictor was removed by step function. Thus the probability of murder doesn’t change if the shooter belongs to a different/same ethnicity than the victim according to this model.
Question (5): is it more likely to survive based on sex/age/ethnicity of the perpetrator?
The model doesn’t give information about perpetrator sex as the associated predictor was removed by step function. Thus the probability of murder is the same as the perpetrator sex changes. Instead we can say something about perpetrator age and ethnicity:
as the perpetrator is older the more likely is for him/her to kill the victim.
if the perpetrator is Asian or White it is more likely for him/her to kill the victim. Instead if the perpetrator is Black or Hispanic the murder probability doesn’t change much by watching the confidence bands.
Question (6): Is there a trend with respect to the date on which the incident occurred?
The model only give information about the year as the other date predictors were removed by step function. Unfortunately the model summary and the confidence bands of year effect tells us that this predictor is not significant.
Question (7): Is survival less likely in late hours?
The model tells us that during the early morning and the morning it is less likely to survive a shooting incident. However the confidence band are very large thus the probability of murder could not change much.
Question (8): Is it more likely to survive on a weekday?
The model doesn’t give information about the weekday as the associated predictor was removed by step function. Thus the probability of murder is the same across weekdays according to this model.
Question (9): What happened during the pandemic period?
The model doesn’t give information about the pandemic period as the associated predictors were removed by step function. Thus the probability of murder is the same even during COVID according to this model.
Furthermore the models gives us addiction information:
if in the shooting incident other victims are present then the probability of murder greatly increases.
if the jurisdiction is patrol the probability of murder increases.
if the Latitude increases the probability of murder increases. The model summary tells us that this predictor is slightly significant.